Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Better error message for Tune nested tasks / actors #25241

Merged
merged 3 commits into from
Jun 3, 2022

Conversation

ericl
Copy link
Contributor

@ericl ericl commented May 26, 2022

Why are these changes needed?

This PR uses a task/actor launch hook to generate better error messages for nested Tune tasks/actors in the case there are no extra resources reserved for them. The idea is that the Tune trial runner actor can set a hook prior to executing the user code. If the user code launches a task, and the placement group for the trial cannot possibly fit the task, then we raise TuneError right off to warn the user.

This isn't perfect, but should catch most common use cases and we can improve the docs as well to help cover this footgun.

Sample code:

import ray
from ray import tune

@ray.remote
def foo():
    pass
   
def objective(*args):
    ray.get(foo.remote())

tune.run(objective)

Sample output:

ray.tune.error.TuneError: No trial resources are available for executing the task `x.foo`. To
resolve this, specify the Tune option:

>  resources_per_trial=tune.PlacementGroupFactory(
>    [{'CPU': 1.0}] + [{'CPU': 1.0}] * N
>  )

Where `N` is the number of slots to reserve for trial tasks. For more information, refer to:
https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html

Related issue number

Closes #25239

raise AbortTrialExecution(
f"Your trial launched the Ray task `{name}`, but no extra CPUs were "
"reserved for task execution. Specify the Tune argument "
'`per_trial_task_resources={"cpu": N}` to reserve CPUs for task execution.')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user could also specify the placement group factory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should settle down on whether we want PGF to be end user API. It's pretty unintuitive IMO.

python/ray/remote_function.py Outdated Show resolved Hide resolved
@ericl ericl changed the title [WIP / RFC] Better error message for Tune nested tasks / actors [tune] Better error message for Tune nested tasks / actors May 28, 2022
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 28, 2022
],
)
ray.get(placement_group.ready())
# Tune launches its trainable functions in placement groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, would this be a bit of a jump for users as why we are suddenly talking about using dataset in the context of Tune? Should we add a bit of context as to why this is an important key concept to understand? Like resource contention, hanging etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the text around this section to clarify this.

# Setup hooks for generating placement group resource deadlock warnings.
from ray import actor, remote_function

if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add to doc/source/tune/api_docs/env.rst? Or we intend this to be more of an internal feature flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you could add this to the docs.

Also

Suggested change
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ:
if os.environ.get("TUNE_DISABLE_RESOURCE_CHECKS", "0") != "1":

to follow convention with other tune env variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep this a hidden / internal flag only, that's not documented. It should only be used for internal debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok with me

)


def test_scheduling_strategy_override():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I got lost here. Can you add a few comments about what this is trying to capture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

resources = {k: float(v) for k, v in resources.items() if v > 0}

raise TuneError(
f"No trial resources are available for executing the task `{name}`. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launching the {submitted}: {name}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

python/ray/tune/session.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks, only minor nits/suggestions

python/ray/tune/session.py Outdated Show resolved Hide resolved
f"> [{main_resources}] + [{resources}] * N\n"
"> )\n\n"
f"Where `N` is the number of slots to reserve for trial {submitted}s. "
"For more information, refer to: https://docs.ray.io/en/latest/tune/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for wording

Suggested change
"For more information, refer to: https://docs.ray.io/en/latest/tune/"
"If you are using a Ray training library, there might be a utility "
"function to set this automatically for you. "
"For more information, refer to: https://docs.ray.io/en/latest/tune/"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# Setup hooks for generating placement group resource deadlock warnings.
from ray import actor, remote_function

if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you could add this to the docs.

Also

Suggested change
if "TUNE_DISABLE_RESOURCE_CHECKS" not in os.environ:
if os.environ.get("TUNE_DISABLE_RESOURCE_CHECKS", "0") != "1":

to follow convention with other tune env variables

@ericl ericl requested a review from jianoaix as a code owner June 1, 2022 20:51
@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 1, 2022
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 2, 2022
@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 3, 2022
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -248,7 +248,7 @@ def read_datasource(
)

if len(read_tasks) < parallelism and (
len(read_tasks) < ray.available_resources().get("CPU", parallelism) // 2
len(read_tasks) < ray.available_resources().get("CPU", 1) // 2
Copy link
Contributor Author

@ericl ericl Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CPU == 0 / empty, fallback to 1 instead of parallelism (too large).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune / AIR] Far too easy to cause a hang in Ray tune with nested tasks
7 participants